Ofir Nachum
Abstract :
We review basic concepts of convex duality and summarize how this duality may be applied to a variety of
reinforcement learning (RL) settings, including policy evaluation or optimization, online or offline learning, and
discounted or undiscounted rewards. The derivations yield a number of intriguing results, including the ability to
perform policy evaluation and on-policy policy gradient with behavior-agnostic offline data and methods to learn a
policy via max-likelihood optimization. The results we will derive will yield both new algorithms as well as new
perspectives on old algorithms. By providing a unified treatment and perspective on these results, we hope to
enable researchers to better use and apply the tools of convex duality to make further progress in RL.